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Abstract 

To  create  a  pose-invariant  face  recognizer,  one  strategy  is  the  view-based  approach,  which  nses  a  set  of 
example  views  at  different  poses.  Bnt  what  if  we  only  have  one  example  view  available,  snch  as  a  scanned 
passport  photo  -  can  we  still  recognize  faces  nnder  different  poses?  Given  one  example  view  at  a  known 
pose,  it  is  still  possible  to  nse  the  view-based  approach  by  exploiting  prior  knowledge  of  faces  to  generate 
virtual  views,  or  views  of  the  face  as  seen  from  different  poses.  To  represent  prior  knowledge,  we  nse  2D 
example  views  of  prototype  faces  nnder  different  rotations.  We  will  develop  example-based  techniqnes  for 
applying  the  rotation  seen  in  the  prototypes  to  essentially  “rotate”  the  single  real  view  which  is  available. 
Next,  the  combined  set  of  one  real  and  mnltiple  virtnal  views  is  nsed  as  example  views  in  a  view-based, 
pose-invariant  face  recognizer.  Onr  experiments  snggest  that  for  expressing  prior  knowledge  of  faces,  2D 
example-based  approaches  shonld  be  considered  alongside  the  more  standard  3D  modeling  techniqnes. 
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1  Introduction 

Existing  work  in  face  recognition  has  demonstrated  good 
recognition  performance  on  frontal,  expressionless  views 
effaces  with  controlled  lighting  (see  Baron  [4],  Tnrk  and 
Pentland  [48],  Bichsel  [11],  Brnnelli  and  Poggio  [14],  and 
Gilbert  and  Yang  [20]).  One  of  the  key  remaining  prob¬ 
lems  in  face  recognition  is  to  handle  the  variability  in 
appearance  dne  to  changes  in  pose,  expression,  and  light¬ 
ing  conditions.  There  has  been  some  recent  work  in  this 
direction,  snch  as  pose-invariant  recognizers  (Pentland, 
el  aL  [34],  Beymer  [10])  and  deformable  template  ap¬ 
proaches  (Manjnnath,  el  aL  [30]).  In  addition  to  recog¬ 
nition,  richer  models  for  faces  have  been  stndied  for  an¬ 
alyzing  varying  illnmination  (Hallinan  [22])  and  expres¬ 
sion  (Yacoob  and  Davis  [55],  Essa  and  Pentland  [19]). 

In  this  paper,  we  address  the  problem  of  recognizing 
faces  nnder  varying  pose  when  only  one  example  view 
per  person  is  available.  Eor  example,  perhaps  jnst  a 
driver’s  license  photograph  is  available  for  each  person 
in  the  database.  If  we  wish  to  recognize  new  images  of 
these  people  nnder  a  range  of  viewing  directions,  some 
of  the  new  images  will  differ  from  the  single  view  by  a 
rotation  in  depth.  Is  recognition  still  possible? 

There  are  a  few  potential  approaches  to  the  problem 
of  face  recognition  from  one  example  view.  Eor  exam¬ 
ple,  the  invariant  featnres  approach  records  featnres  in 
the  example  view  that  do  not  change  as  pose-expression- 
lighting  parameters  change,  featnres  snch  as  color  or  ge¬ 
ometric  invariants.  While  not  yet  applied  to  face  recog¬ 
nition,  this  approach  has  been  nsed  for  face  detection 
nnder  varying  illnmination  (Sinha  [45])  and  for  indexing 
of  packaged  grocery  items  nsing  color  (Swain  and  Bal¬ 
lard  [46]). 

In  the  flexible  matching  approach  (von  der  Malsbnrg 
and  collaborators  [30]  [25]),  the  inpnt  image  is  deformed 
in  2D  to  match  the  example  view.  In  [30],  the  deforma¬ 
tion  is  driven  by  a  matching  of  local  “end-stop”  featnres 
so  that  the  resnlting  transformation  between  model  and 
inpnt  is  like  a  2D  warp  rather  than  a  global,  rigid  trans¬ 
form.  This  enables  the  deformation  to  match  inpnt  and 
model  views  even  thongh  they  may  differ  in  expression 
or  ont-of-plane  rotations.  A  deformation  matching  the 
inpnt  with  a  model  view  is  evalnated  by  a  cost  fnnc- 
tional  that  measnres  both  the  similarity  of  matched  fea¬ 
tnres  and  the  geometrical  distortion  indnced  by  the  de¬ 
formation.  In  this  method,  the  difficnlties  inclnde  (a) 
constrncting  a  generally  valid  cost  fnnctional,  and  (b) 
the  compntational  expense  of  a  non-convex  optimiza¬ 
tion  problem  at  rnn-time.  However,  since  this  matching 
mechanism  is  qnite  general  (it  does  not  take  into  con¬ 
sideration  any  prior  model  of  hnman  facial  expression  or 
3D  strnctnre),  it  may  be  nsed  for  a  variety  of  objects. 

Generic  3D  models  of  the  hnman  face  can  be  nsed  to 
predict  the  appearance  of  a  face  nnder  different  pose- 
expression-lighting  parameters.  Eor  synthesizing  images 
effaces,  3D  facial  models  have  been  explored  in  the  com- 
pnter  graphics,  compnter  vision,  and  model-based  im¬ 
age  coding  commnnities  (Aitchison  and  Craw[l],  Kang, 
Chen,  and  Hsn[24],  Essa  and  Pentland  [19],  Akimoto, 
Snennaga,  and  Wallace[3],  Waters  and  Terzoponlos[47], 
Aizawa,  Harashima,  and  Saito[2]).  In  the  3D  techniqne,  ^ 


face  shape  is  represented  either  by  a  polygonal  model 
or  by  a  more  complicated  mnltilayer  mesh  that  simn- 
lates  tissne.  Once  a  2D  face  image  is  textnre  mapped 
onto  the  3D  model,  the  face  can  be  treated  as  a  tradi¬ 
tional  3D  object  in  compnter  graphics,  nndergoing  3D 
rotations  or  changes  in  light  sonrce  position.  Eaces  are 
textnre  mapped  onto  the  3D  model  either  by  specifying 
corresponding  facial  featnres  in  both  the  image  and  3D 
model  or  by  recording  both  3D  depth  and  color  image 
data  simnltaneonsly  by  nsing  specialized  eqnipment  like 
the  Cyberware  scanner.  Prior  knowledge  for  expression 
has  been  added  to  the  3D  model  by  embedding  mnscle 
forces  that  deform  the  3D  model  in  a  way  that  mimics 
hnman  facial  mnscles. 

A  generic  3D  model  conld  also  be  applied  to  onr  sce¬ 
nario  of  pose-invariant  face  recognition  from  one  example 
view.  The  single  view  of  each  person  conld  be  textnre 
mapped  onto  a  3D  model,  and  then  the  3D  model  conld 
be  rotated  to  novel  poses.  Applying  this  strategy  to  face 
recognition,  to  onr  knowledge,  has  not  yet  been  explored. 

While  3D  models  are  one  method  for  nsing  prior 
knowledge  of  faces  to  synthesize  new  views  from  jnst 
one  view,  in  this  paper  we  investigate  representing  this 
prior  face  knowledge  in  an  example-based  manner,  ns¬ 
ing  2D  views  of  prototype  faces.  Since  we  address  the 
problem  of  recognition  nnder  varying  pose,  the  views  of 
prototype  faces  will  sample  different  rotations  ont  of  the 
image  plane.  In  principle,  thongh,  different  expressions 
and  lightings  can  be  modeled  by  sampling  the  proto¬ 
type  views  nnder  those  parameters.  Given  one  view  of  a 
person,  we  will  propose  two  methods  for  nsing  the  infor¬ 
mation  in  the  prototype  views  to  synthesize  new  views 
of  the  person,  views  from  different  rotations  in  onr  case. 
Eollowing  Poggio  and  Vetter  [39],  we  call  these  synthe¬ 
sized  views  virtual  views. 

Onr  motivation  for  nsing  the  example-based  approach 
is  its  potential  for  being  a  simple  alternative  to  the 
more  complicated  3D  model-based  approach.  Using  an 
example-based  approach  to  bypass  3D  models  for  3D  ob¬ 
ject  recognition  was  first  explored  in  the  linear  combina¬ 
tions  approach  to  recognition  (Ullman  and  Basri  [49], 
Poggio  [35]).  In  linear  combinations,  one  can  show  that 
a  2D  view  of  an  object  nnder  rigid  3D  transformation 
can  be  written  as  a  linear  combination  of  a  small  set 
of  2D  example  views,  where  the  2D  view  representation 
is  a  vector  of  (x,  y)  locations  of  a  set  of  featnre  points. 
This  is  valid  for  a  range  of  viewpoints  in  which  a  nnmber 
of  featnre  points  are  visible  in  all  views  and  thns  can  be 
bronght  into  correspondence  for  the  view  representation. 
This  snggests  an  object  may  be  represented  nsing  a  set 
of  2D  views  instead  of  a  3D  model. 

Poggio  and  Vetter  [39]  have  discnssed  this  linear  com¬ 
binations  approach  in  the  case  where  only  one  example 
view  is  available  for  an  object,  laying  the  gronndwork  for 
virtnal  views.  Normally,  with  jnst  one  view,  3D  recogni¬ 
tion  is  not  possible.  However,  any  method  for  generating 
additional  object  views  wonld  enable  a  recognition  sys¬ 
tem  to  nse  the  the  linear  combinations  approach.  This 
motivated  Poggio  and  Vetter  to  introdnce  the  idea  of 
nsing  prior  knowledge  of  object  class  to  generate  vir¬ 
tnal  views.  Two  types  of  prior  knowledge  were  explored. 


knowledge  of  3D  object  symmetry  and  example  images 
of  prototypical  objects  of  the  same  class.  In  the  former, 
the  mirror  reflection  of  the  single  example  can  be  gen¬ 
erated,  and  the  latter  leads  to  the  idea  of  linear  classes, 
which  we  will  explain  and  nse  later  in  this  paper. 

In  this  paper,  after  discnssing  methods  for  generat¬ 
ing  virtnal  views,  we  evalnate  their  nsefnlness  in  a  view- 
based,  pose-invariant  face  recognizer.  Given  only  one 
real  example  view  per  person,  we  will  synthesize  a  set 
of  rotated  virtnal  views,  views  that  cover  np/down  and 
left/right  rotations.  The  combined  set  of  one  real  and 
mnltiple  virtnal  views  will  be  nsed  as  example  views  in 
a  view-based  face  recognizer.  Recognition  performance 
will  be  reported  on  a  separate  test  set  of  faces  that  cover 
a  range  of  rotations  both  in  and  ont  of  the  image  plane. 

Independent  from  onr  work,  Lando  and  Edelman  [26] 
have  recently  investigated  the  same  overall  qnestion  - 
generalization  from  a  single  view  in  face  recognition  - 
nsing  a  similar  example-based  techniqne  for  represent¬ 
ing  prior  knowledge  of  faces.  In  addition,  Manrer  and 
von  der  Malsbnrg  [31]  have  investigated  a  techniqne 
for  transforming  their  “jet”  featnres  across  rotations  in 
depth.  Their  techniqne  is  more  3D  than  onrs,  as  it  nses 
a  local  planarity  assnmption  and  knowledge  of  local  snr- 
face  normals. 

2  Vectorized  image  representation 

Onr  example-based  techniqnes  for  generating  virtnal 
views  nse  a  vectorized  face  representation,  which  is  an 
ordered  vector  of  image  measnrements  taken  at  a  set  of 
facial  featnre  points.  These  featnres  can  rnn  the  gamnt 
from  sparse  featnres  with  semantic  meaning,  snch  as  the 
corners  of  the  eyes  and  month,  to  pixel  level  featnres  that 
are  defined  by  the  local  grey  level  strnctnre  of  the  image. 
By  an  ordered  vector,  we  mean  that  the  facial  featnres 
have  been  ennmerated  /i,  /2,  •  •  • ,  /n,  and  that  the  vector 
representation  first  contains  measnrements  from  /i,  then 
/2,  etc.  The  measnrements  at  a  given  featnre  will  inclnde 
its  (x,  y)  location  -  a  measnre  of  face  “shape”  -  and  local 
image  color  or  intensity  -  a  measnre  of  face  “textnre” . 
The  key  part  of  this  vectorized  representation  is  that  the 
facial  featnres  /i ,  /2 ,  •  •  • ,  /n  are  effectively  pnt  into  corre¬ 
spondence  across  the  face  images  being  “vectorized”  .  For 
example,  if  /i  is  the  onter  corner  of  the  left  eye,  then  the 
first  three  elements  of  onr  vector  representation  will  re¬ 
fer  to  the  (a?i,  yi,  intensity-patch(a?i,  yi))  measnrements 
of  that  featnre  point  for  any  face  being  vectorized. 

2.1  Shape 

Given  the  locations  of  featnres  /i ,  /2,  •  •  • ,  /n ,  shape  is 
represented  by  a  vector  y  of  length  2n  consisting  of  the 
concatenation  of  the  x  and  y  coordinate  valnes 


In  onr  notation,  if  an  image  being  vectorized  has  an  iden¬ 
tifying  snbscript  (e.g.  ia),  then  the  vector  y  will  carry  ^ 


the  same  snbscript,  y^.  The  coordinate  system  nsed 
for  measnring  x  and  y  will  be  one  normalized  by  nsing 
the  eye  locations  to  fix  interocnlar  distance  and  remove 
head  tilt.  By  factoring  ont  the  2D  aspects  of  pose,  the 
remaining  variability  in  shape  vectors  will  be  cansed  by 
expressions,  rotations  ont  of  the  image  plane,  and  the 
natnral  variation  in  the  confignration  of  featnres  seen 
across  people. 

This  vectorized  representation  for  2D  shape  has  been 
widely  nsed,  inclnding  network-based  object  recogni¬ 
tion  (Poggio  and  Edelman  [37]),  the  linear  combinations 
approach  to  recognition  (Ullman  and  Basri  [49],  Pog¬ 
gio  [35]),  active  shape  models  (Cootes  and  Taylor  [15], 
Cootes,  ei  aL  [16])  and  face  recognition  (Craw  and 
Cameron  [17][18]).  In  these  shape  vectors,  a  sparse  set  of 
featnre  points,  on  the  order  of  lO’s  of  featnres,  are  either 
mannally  placed  on  the  object  or  located  nsing  a  featnre 
finder.  For  a  face,  example  featnre  points  may  inclnde 
the  inner  and  onter  corners  of  the  eyes,  the  corners  of 
the  month,  and  points  along  the  eyebrows  and  sides  of 
the  face. 

In  this  paper  we  nse  a  dense  representation  of  one  fea¬ 
tnre  per  pixel,  a  representation  originally  snggested  to 
ns  by  the  object  recognition  work  of  Shashna  [43].  Com¬ 
pared  to  a  sparser  representation,  the  pixelwise  represen¬ 
tation  increases  the  difficnlty  of  finding  correspondences. 
However,  we  have  fonnd  that  a  standard  optical  fiow  al¬ 
gorithm  [7],  preceded  by  normalization  based  on  the  eye 
locations,  can  do  a  good  job  at  antomatically  compnting 
dense  pixelwise  correspondences.  After  defining  one  im¬ 
age  as  a  “reference”  image,  the  (x,  y)  locations  of  featnre 
points  of  a  new  image  are  compnted  by  finding  optical 
fiow  between  the  two  images.  Thns  the  shape  vector  of 
the  new  image,  really  a  “relative”  shape,  is  described 
by  a  fiow  or  a  vector  field  of  correspondences  relative 
to  a  standard  reference  shape.  Onr  face  vectorizer  (see 
Beymer  [9]),  which  nses  optical  fiow  as  a  snbrontine,  is 
also  nsed  to  antomatically  compnte  the  vectorized  rep¬ 
resentation. 

Optical  fiow  matches  featnres  in  the  two  frames  nsing 
the  local  grey  level  strnctnre  of  the  images.  As  opposed 
to  a  featnre  finder,  where  the  “semantics”  of  featnres  is 
determined  in  advance  by  the  particnlar  set  of  featnres 
songht  by  the  featnre  finder,  the  reference  image  provides 
shape  “semantics”  in  the  relative  representation.  For 
example,  to  find  the  corner  of  the  left  eye  in  a  relative 
shape,  one  follows  the  vector  field  starting  from  the  left 
eye  corner  pixel  in  the  reference  image. 

Correspondence  with  respect  to  a  reference  shape,  as 
compnted  by  optical  fiow,  can  be  expressed  in  onr  vector 
notation  as  the  difference  between  two  vectorized  shapes. 
Let  ns  chose  a  face  shape  jstd  fo  be  the  reference.  Then 
the  shape  of  an  arbitrary  face  is  represented  by  the 
geometrical  difference  —  jstd,  which  we  shall  abbre¬ 
viate  Ya-std-  This  is  still  a  vector  of  length  2n,  bnt  now 
it  is  a  vector  field  of  correspondences  between  images 
ia  and  igtd-  In  addition,  we  keep  track  of  the  reference 
frame  by  nsing  a  snperscript,  so  we  add  the  snperscript 
std  to  the  shape  utility  of  keeping  track 

of  the  reference  image  will  become  more  apparent  when 
describing  operations  on  shapes.  Fig.  1  shows  the  shape 


^std 


Figure  1:  Our  vectorized  representation  for  image  ia 
with  respect  to  the  reference  image  at  standard 
shape.  First,  pixelwise  correspondence  is  computed  be¬ 
tween  isid  and  /'a,  as  indicated  by  the  grey  arrow.  Shape 
y^a-sid  ^  vector  field  that  specifies  a  corresponding 
pixel  in  ia  for  each  pixel  in  i^id-  Texture  consists  of 
the  grey  levels  of  ia  mapped  onto  the  standard  shape. 
In  this  figure,  image  i^td  is  the  mean  grey  level  image  of 
55  example  faces  that  have  been  warped  to  the  standard 
reference  shape  ystd- 

representation  y^a-std  image  ia-  As  indicated  by 

the  grey  arrow,  correspondences  are  measured  relative  to 
the  reference  face  i^td  at  standard  shape.  This  relative 
shape  representation  has  been  used  by  Beymer,  Shashua, 
and  Poggio  [8]  in  an  example-based  approach  to  image 
analysis  and  synthesis. 

2.2  Texture 

Our  texture  vector  is  a  geometrically  normalized  ver¬ 
sion  of  the  image  ia-  That  is,  the  geometrical  differences 
among  face  images  are  factored  out  by  warping  the  im¬ 
ages  to  a  common  reference  shape.  This  strategy  for 
representing  texture  has  been  used,  for  example,  in  the 
face  recognition  works  of  Craw  and  Cameron  [17],  and 
Shackleton  and  Welsh  [42].  If  we  let  shape  ystd  be  the 
reference  shape,  then  the  geometrically  normalized  im¬ 
age  ta  is  given  by  the  2D  warp 

taix,  y)  =  iaix  +  y),  y  +  y)), 

where  Ax^^j^  and  Ayf."  g-f-d  are  the  x  and  y  components 
of  y^a-std^  pixelwise  mapping  between  and  the 
standard  shape  ystd-  Fig.  1  in  the  lower  right  shows  an 
example  texture  vector  ta  for  the  input  image  ia  in  the 
upper  right. 

If  shape  is  sparsely  defined,  then  texture  mapping 
or  sparse  data  interpolation  techniques  can  be  em¬ 
ployed  to  create  the  necessary  pixelwise  level  representa¬ 


tion.  Example  sparse  data  interpolation  techniques  in¬ 
clude  using  splines  (Litwinowicz  and  Williams  [28],  Wol- 
berg  [54]),  radial  basis  functions  (Reisfeld,  Arad,  and 
Yeshurun  [40]),  and  inverse  weighted  distance  metrics 
(Beier  and  Neely  [5]).  If  a  pixelwise  representation  is 
being  used  for  shape  in  the  first  place,  such  as  one  de¬ 
rived  from  optical  flow,  then  texture  mapping  or  data 
interpolation  techniques  can  be  avoided. 

For  our  vectorized  representation,  we  have  chosen  a 
dense,  pixelwise  set  of  features.  What  are  some  of  the 
tradeoffs  with  respect  to  a  sparser  set  of  features?  Tex¬ 
ture  processing  is  simplified  over  the  sparse  case  since 
we  avoid  texture  mapping  and  sparse  data  interpolation 
techniques,  instead  employing  a  simple  2D  warping  algo¬ 
rithm.  Additionally,  though,  using  a  pixelwise  represen¬ 
tation  makes  the  vectorized  representation  very  simple 
conceptually:  we  can  think  of  three  measurements  be¬ 
ing  made  per  feature  (x,y,  The  price  we  pay 

for  this  simplicity  is  a  difficult  correspondence  problem. 
In  section  5  we  describe  three  correspondence  techniques 
we  explored  for  computing  the  vectorized  image  repre¬ 
sentation. 

3  Prior  knowledge  of  object  class: 
prototype  views 

In  our  example-based  approach  for  generating  virtual 
views,  prior  knowledge  of  face  transformations  such  as 
changes  in  rotation  or  expression  are  represented  by  2D 
views  of  prototypical  faces.  Let  there  be  N  prototype 
faces  Pj,l  <  j  <  Y,  where  the  prototypes  are  chosen  to 
be  representative  of  the  variation  in  the  class  of  faces. 
Unlike  non-prototype  faces  -  for  which  we  only  have  a 
single  example  view  -  many  views  are  available  for  each 
prototype  pj . 

Given  a  single  real  view  of  a  novel  face  at  a  known 
pose,  we  wish  to  transform  the  face  to  produce  a  rotated 
virtual  view.  Call  the  known  pose  of  the  real  view  the 
standard  pose  and  the  pose  of  the  desired  virtual  view 
the  virtual  pose.  Images  of  the  prototype  faces  are  then 
collected  for  both  the  standard  and  virtual  poses.  As 
shown  in  Fig.  2,  let 

ip^  =  set  of  N  prototype  views  at  standard  pose, 
ip^^r  —  set  of  N  prototype  views  at  virtual  pose, 

where  1  A  i  A  Y.  Since  we  wish  to  synthesize  many 
virtual  views  from  the  same  standard  pose,  sets  of  pro¬ 
totype  views  at  the  virtual  pose  will  be  acquired  for  all 
the  desired  virtual  views. 

The  techniques  we  explore  for  generating  virtual  views 
work  with  the  vectorized  image  representation  intro¬ 
duced  in  the  previous  section.  That  is,  the  prototype  im¬ 
ages  ip^  and  ip^^r  have  been  vectorized,  producing  shape 
vectors  and  yp^^r  and  texture  vectors  tp^  and  tp^^r- 
The  specific  techniques  we  used  to  vectorize  images  will 
be  discussed  in  section  5. 

In  the  vectorized  image  representation,  a  set  of  images 
are  brought  into  correspondence  by  locating  a  common 
set  of  feature  points  across  all  images.  Since  the  set  of 
prototype  views  contain  a  variety  of  both  people  and 
viewpoints,  our  definition  of  the  vectorized  representa¬ 
tion  implies  that  correspondence  needs  to  be  computed 


ip,  ,  1  <j<N  ip,^r  y  i 


Figure  2:  To  represent  prior  knowledge  of  a  facial  trans¬ 
form  (rotation  upwards  in  the  figure),  views  of  N  pro¬ 
totype  faces  are  collected  at  the  standard  and  virtual 
poses. 

across  different  viewpoints  as  well  as  different  people. 
However,  the  two  techniques  for  generating  virtual  views, 
parallel  deformation  and  linear  classes,  have  different  re¬ 
quirements  in  terms  of  correspondence  across  viewpoint. 
Parallel  deformation  requires  these  correspondences,  so 
the  prototype  views  are  vectorized  as  one  large  set.  On 
the  other  hand,  linear  classes  does  not  require  correspon¬ 
dence  across  viewpoints,  so  the  set  of  images  is  parti¬ 
tioned  by  viewpoint  and  separate  vectorizations  defined 
for  each  viewpoint.  In  this  latter  case,  vectorization  is 
simply  handling  correspondence  across  the  different  pro¬ 
totypes  at  a  fixed  pose. 

4  Virtual  views  synthesis  techniques 

In  this  section  we  explore  two  techniques  for  generating 
virtual  views  of  a  “novel”  face  for  which  just  one  view  is 
available  at  standard  pose: 

1.  Linear  classes.  Using  multiple  prototype  objects, 
first  write  the  novel  face  as  a  linear  combination  of 
prototypes  at  the  standard  pose,  yielding  a  set  of 
linear  prototype  coefficients.  Then  synthesize  the 
novel  face  at  the  virtual  pose  by  taking  the  lin¬ 
ear  combination  of  prototype  objects  at  the  virtual 
pose  using  the  same  set  of  coefficients.  Using  this 
approach,  as  discussed  in  Poggio  [36]  and  Poggio 
and  Vetter  [39],  it  is  possible  to  “learn”  a  direct 
mapping  from  standard  pose  to  a  particular  virtual 
pose. 

2.  Parallel  deformation.  Using  just  one  prototype  ob¬ 
ject,  measure  the  2D  deformation  of  object  features 
going  from  the  standard  to  virtual  view.  Then 
map  this  2D  deformation  onto  the  novel  object 
and  use  the  deformation  to  distort,  or  warp,  the 
novel  image  from  the  standard  pose  to  the  virtual 
one.  The  technique  has  been  explored  previously 
by  Brunelli  and  Poggio  [38]  within  the  context  of 
an  “example-based”  approach  to  computer  graph¬ 
ics  and  by  researchers  in  performance-driven  ani¬ 
mation  (Williams  [52] [53],  Patterson,  Litwinowicz,  ^ 


and  Greene  [33]). 

For  notation,  let  in  be  the  single  real  view  of  the  novel 
face  in  standard  pose.  The  virtual  view  will  be  denoted 

,r  • 

4.1  Linear  Classes 

Because  the  theory  of  linear  classes  begins  with  a  model¬ 
ing  assumption  in  3D,  let  us  generalize  the  2D  vectorized 
image  representation  to  a  3D  object  representation.  Re¬ 
call  that  the  2D  image  vectorization  is  based  on  estab¬ 
lishing  feature  correspondence  across  a  set  of  2D  images. 
In  3D,  this  simply  becomes  finding  a  set  of  correspond¬ 
ing  3D  points  for  a  set  of  objects.  The  feature  points  are 
distributed  over  the  face  in  3D  and  thus  may  not  all  be 
visible  from  any  one  single  view.  Two  measurements  are 
made  at  each  3D  feature  point: 

1.  Shape.  The  (x,y,z)  coordinates  of  the  feature 
point.  If  there  are  n  feature  points,  the  vector  Y 
will  be  a  vector  of  length  3n  consisting  of  the  x,  y, 
and  z  coordinate  values. 

2.  Texture.  If  we  assume  that  the  3D  object  is  Lam¬ 
bertian  and  fix  the  lighting  direction  I  =  (Ixjyjz), 
we  can  measure  the  intensity  of  light  refiected  from 
each  feature  point,  independent  of  viewpoint.  At 
the  ith  feature  point,  the  intensity  T[i]  is  given  by 

T[i]  =  p[i](m-i),  (1) 

where  p[i]  is  the  albedo,  or  local  surface  refiectance, 
of  feature  i  and  f][i]  is  the  local  surface  normal  at 
feature 

The  texture  vector  T  is  not  an  image;  one  can  think  of  it 
as  a  texture  that  is  mapped  onto  the  3D  shape  Y  given  a 
particular  set  of  lighting  conditions  1.  One  helpful  way  to 
visualize  of  the  texture  vector  T  is  a  sampling  of  image 
intensities  in  a  cylindrical  coordinate  system  that  covers 
feature  points  over  the  entire  face.  This  is  similar  to  that 
produced  by  the  Cyberware  scanner. 

Consider  the  relationship  between  3D  vectorized 
shape  Y  and  texture  T  and  their  counterpart  2D  ver¬ 
sions  y  and  t.  The  projection  process  of  going  from 
3D  shape  Y  to  2D  shape  y  consists  of  a  3D  rotation, 
occlusion  of  a  set  of  non-visible  feature  points,  and  or¬ 
thographic  projection.  Mathematically,  we  model  this 
using  a  matrix  L 

y  =  ly,  (2) 

^Vetter  and  Poggio  [51]  have  explored  the  implications 
of  3D  linear  combinations  of  shape  on  image  grey  levels,  or 
textnre.  If  an  object  is  a  linear  combination  of  prototype 
objects  in  3D,  then  so  are  the  snrface  normals.  Thns,  nnder 
Lambertian  shading  with  constant  albedo  over  each  object, 
the  grey  level  image  of  the  novel  object  shonld  be  the  same 
linear  combination  of  the  grey  level  prototype  images. 

Also,  it  is  not  strictly  necessary  for  the  object  to  be  Lam¬ 
bertian;  eqnation  (1)  conld  be  a  different  fnnctional  form  of 
p,  rf,  and  L  What  is  necessary  is  that  T[«]  is  independent 
of  lighting  and  viewing  direction,  which  may  be  achieved 
by  fixing  the  light  sonrce  and  assnming  that  the  object  is 
Lambertian. 


where  matrix  L  is  the  product  of  a  3D  rotation  matrix  R, 
an  occlusion  matrix  D  that  simply  drops  the  coordinates 
of  the  occluded  points,  and  orthographic  projection  O 

L  =  ODR. 

Note  that  L  is  a  linear  projection  operator. 

Creating  a  2D  texture  vector  t  at  a  particular  view¬ 
point  V  involves  in  some  sense  “projecting”  the  3D  tex¬ 
ture  T.  This  is  done  by  selecting  the  feature  points  that 
are  visible  in  the  standard  shape  at  viewpoint  v 

t  =  DT,  (3) 

where  D  is  a  matrix  that  drops  points  occluded  in  the 
given  viewpoint.  Thus,  viewpoint  is  handled  in  D;  the 
lighting  conditions  are  fixed  in  T.  Like  operator  L,  D  is 
a  linear  operator. 

The  idea  of  linear  classes  is  based  on  the  assumption 
that  the  space  of  3D  object  vectorizations  for  objects  of 
a  given  class  is  linearly  spanned  by  a  set  of  prototype 
vectorizations.  That  is,  the  shape  Y  and  texture  T  of  a 
class  member  can  be  written  as 

N  N 

Y  =  and  T  =  (4) 

i = 1  i = 1 

for  some  set  of  aj  and  pj  coefficients. 

While  the  virtual  views  methods  based  on  linear 
classes  do  not  actually  compute  the  3D  vectorized  rep¬ 
resentation,  the  real  view  in  is  related  to  the  destina¬ 
tion  virtual  view  in^r  through  the  3D  vectorization  of 
the  novel  object.  First,  a  2D  image  analysis  of  in  at 
standard  pose  estimates  the  aj  and  pj  in  equation  (4) 
by  using  the  prototype  views  ip^ .  Then  the  virtual  view 
in^r  can  be  synthesized  using  the  linear  coefficients  and 
the  prototype  views  ip^^r-  Let  us  now  examine  these 
steps  in  detail  for  the  shape  and  texture  of  the  novel 
face. 

4.1.1  Virtual  shape 

Given  the  vectorized  shape  of  the  novel  person  and 
the  prototype  vectorizations  and  1  ^  ^ 

linear  classes  can  be  used  to  synthesize  vectorized  shape 
Yn^r  at  the  virtual  pose.  This  idea  was  first  developed 
by  Poggio  and  Vetter  [39]. 

In  linear  classes,  we  assume  that  the  novel  3D  shape 
can  be  written  as  a  linear  combination  of  the  proto¬ 
type  shapes  Yp^ 

Yn  =  Ef=iaWw  (5) 

If  the  linear  class  assumption  holds  and  the  set  of  2D 
views  Yp^  are  linearly  independent,  then  we  can  solve 
for  the  a^-’s  at  the  standard  view 

yn  =  Ef=i  ajYpj  (6) 

and  use  the  prototype  coefficients  aj  to  synthesize  the 
virtual  shape 

yn,r  =  =  l  ^jypj,r- 

This  is  true  under  orthographic  projection.  The  mathe¬ 
matical  details  are  provided  in  Appendix  B. 


While  this  may  seem  to  imply  that  we  can  perform  a 
3D  analysis  based  on  one  2D  view  of  an  object,  the  lin¬ 
ear  class  assumption  cannot  be  verified  using  2D  views. 
Thus,  from  just  the  2D  analysis,  the  technique  can  be 
“fooled”  into  thinking  that  it  has  found  a  good  set  of 
linear  coefficients  when  in  fact  equation  (5)  is  poorly  ap¬ 
proximated.  That  is,  the  technique  will  be  fooled  when 
the  actual  3D  shape  of  the  novel  person  is  different  from 
the  3D  interpolated  prototype  shape  in  the  right  hand 
side  of  equation  (5). 

In  solving  equations  (6)  and  (7),  the  linear  class  ap¬ 
proach  can  be  interpreted  as  creating  a  direct  mapping 
from  standard  to  virtual  pose.  That  is,  we  can  derive  a 
function  that  maps  from  y’s  in  standard  pose  to  y’s  in 
the  virtual  pose.  Let  V  be  a  matrix  where  column  j  is 
Ypj ,  and  let  17  be  a  matrix  where  column  j  is  Ypj  ,r  •  Then 
if  we  solve  for  equation  (6)  using  linear  least  squares  and 
plug  the  resulting  a’s  into  equation  (7),  then 

yn,r=VyVn,  (8) 

where  is  the  pseudoinverse  (Y^Y)~^Y^ . 

Another  way  to  formulate  the  solution  as  a  direct 
mapping  is  to  train  a  network  to  learn  the  association 
between  standard  and  virtual  pose  (see  Poggio  and  Vet¬ 
ter  [39]).  The  (input,  output)  pairs  presented  to  the 
network  during  training  would  be  the  prototype  pairs 
(Ypj  5  ypj,r)-  A  potential  architecture  for  such  a  network 
is  suggested  by  the  fact  that  equation  (8)  can  be  imple¬ 
mented  by  a  single  layer  linear  network.  The  weights 
between  the  input  and  output  layers  are  given  simply  by 
the  matrix  17 V^. 

4.1.2  Virtual  texture 

In  addition  to  generating  the  shape  component  of 
virtual  views,  the  prototypes  can  also  be  used  to  gen¬ 
erate  the  texture  of  virtual  views.  Given  the  texture 
of  a  novel  face  t^^  and  the  prototype  textures  tp^  and 
^  ^  ^5  fbe  concept  of  linear  classes  can  be 

used  to  synthesize  the  virtual  texture  This  synthe¬ 

sized  grey  level  texture  is  then  warped  or  texture  mapped 
onto  the  virtual  shape  to  create  a  finished  virtual  view. 
The  ideas  presented  in  this  section  were  developed  by 
the  authors  and  also  independently  by  Vetter  and  Pog- 
gio  [51]. 

To  generate  the  virtual  texture  we  propose  using 
the  same  linear  class  idea  of  approximation  at  the  stan¬ 
dard  view  and  reconstruction  at  the  virtual  view.  Simi¬ 
larly  to  the  shape  case,  this  relies  on  the  assumption  that 
the  space  of  grey  level  textures  T  is  linearly  spanned  by 
a  set  of  prototype  textures.  The  validity  of  this  assump¬ 
tion  is  borne  out  by  recent  successful  face  recognition 
systems  (e.g.  eigenfaces,  Pentland,  ei  aL  [34]).  First,  as¬ 
sume  that  the  novel  texture  can  be  written  as  a  linear 
combination  of  the  prototype  textures  Tp^ 

Tn  =  E7i/3jTp,-  (9) 

The  analog  of  linear  classes  for  texture,  presented  in  Ap¬ 
pendix  B,  says  that  if  this  assumption  holds  and  the  2D 
textures  tp^  are  linearly  independent,  then  we  should  be 
able  to  decompose  the  real  texture  in  terms  of  the 


example  textures  tp^ 

tn  =  =  l  l^j^Pj  (10) 

and  use  the  same  set  of  coefficients  to  reconstruct  the 
texture  of  the  virtual  view 


—  Si=i 


(11) 


Note  that  the  texture  T  and  hence  the  pj  coefficients 
are  dependent  on  the  lighting  conditions.  Thus,  by  com¬ 
puting  different  views  t  using  the  D  operator,  we  are 
effectively  rotating  the  camera  around  the  object.  The 
geometry  between  object  and  light  source  is  kept  fixed. 

We  have  synthesized  textures  for  rotations  of  10  to  15 
degrees  between  standard  and  virtual  poses  with  reason¬ 
able  results;  see  section  5  for  example  images  and 
section  6  for  recognition  experiments.  In  terms  of  com¬ 
puting  tn^r  from  we  can  use  the  same  linear  solution 
technique  as  for  shape  (equation  (8)). 


4.2  Parallel  deformation 

While  the  linear  class  idea  does  not  require  the  y  vectors 
to  be  in  correspondence  between  the  standard  and  vir¬ 
tual  views,  if  we  add  such  “cross  view”  correspondence 
then  the  linear  class  idea  can  be  interpreted  as  finding 
a  2D  deformation  from  to  yn,r-  Having  shape  vec¬ 
tors  in  cross  view  correspondence  simply  means  that  the 
y  vectors  in  both  poses  refer  to  the  same  set  of  facial 
feature  points.  The  advantage  of  computing  this  2D  de¬ 
formation  is  that  the  texture  of  the  virtual  view  can  be 
generated  by  texture  mapping  directly  from  the  original 
view  in .  This  avoids  the  need  for  additional  techniques 
to  synthesize  virtual  texture  at  the  virtual  view. 

To  see  the  deformation  interpretation,  subtract  equa¬ 
tion  (6)  from  (7)  and  move  y^^  to  the  other  side,  yielding 

yn.r  =  yn  +  Ef=l  “  Yp/)-  (12) 

Bringing  shape  vectors  from  the  different  poses  together 
in  the  same  equation  is  legal  because  we  have  added  cross 
view  correspondence.  The  quantity  Ay^-  =  —  Ypj 

is  a  2D  warp  that  specifies  how  prototype  feature 
points  move  under  the  prototype  transformation.  Equa¬ 
tion  (12)  modifies  the  shape  y^^  by  a  linear  combination 
of  these  prototype  deformations.  The  coefficients  of  this 
linear  combination,  the  a^-’s,  are  given  by  Y^jn,  the  so¬ 
lution  to  the  approximation  equation  (6). 

Consider  as  a  special  case  the  deformation  approach 
with  just  one  prototype.  In  this  case,  the  novel  face  is 
deformed  in  a  manner  that  imitates  the  deformation  seen 
in  the  prototype.  This  is  similar  to  performance-driven 
animation  (Williams  [52]),  and  Poggio  and  Brunelli  [38], 
who  call  it  parallel  deformation,  have  suggested  it  as  a 
computer  graphics  tool  for  animating  objects  when  pro¬ 
vided  with  just  one  view.  Specializing  equation  (12) 
gives 

yn,r  =  yn  +  (yp,r  -  yp),  (13) 

where  we  have  dropped  the  j  subscripts  on  the  prototype 
variable  p.  The  deformation  Ay  =  y^  —  y^  essentially 
represents  the  prototype  transform  and  is  the  same  2D 
warping  as  in  the  multiple  prototypes  case. 


By  looking  at  the  one  prototype  case  through  special¬ 
izing  the  original  equations  (6)  and  (7),  we  get  y^^  =  y^ 
and  Yn.r  —  Yp^r-  This  seems  to  say  that  the  virtual 
shape  Yn^r  is  simply  that  of  the  prototype  at  virtual  pose, 
so  why  should  equation  (13)  give  us  anything  different? 
However,  the  specialized  equations,  which  approximate 
the  novel  shape  by  prototype  shape,  are  likely  to  be  poor 
approximations.  Thus,  we  should  really  add  error  terms, 
writing  y^  =  y^  -h  y error,  and  yn,r  =  yp,r  +  Yerror^- 
The  error  terms  are  likely  to  be  highly  correlated,  so  by 
subtracting  the  equations  -  as  is  done  by  parallel  defor¬ 
mation  -  we  cancel  out  the  error  terms  to  some  degree. 

4.3  Comparing  linear  classes  and  parallel 
deformation 

What  are  some  of  the  relative  advantages  of  linear  classes 
and  parallel  deformation?  First,  consider  some  of  the  ad¬ 
vantages  of  linear  classes  over  parallel  deformation.  Par¬ 
allel  deformation  works  well  when  the  3D  shape  of  the 
prototype  matches  the  3D  shape  of  the  novel  person.  If 
the  two  3D  shapes  differ  enough,  the  virtual  view  gen¬ 
erated  by  parallel  deformation  will  appear  geometrically 
distorted.  Linear  classes,  on  the  other  hand,  effectively 
tries  to  construct  a  prototype  that  matches  the  novel 
shape  by  taking  the  proper  linear  combination  of  exam¬ 
ple  prototypes.  Another  advantage  of  linear  classes  is 
that  correspondence  is  not  required  between  standard 
and  virtual  poses.  Thus,  linear  classes  may  be  able  to 
cover  a  wider  range  of  rotations  out  of  the  image  plane 
as  compared  to  parallel  deformation. 

One  advantage  of  parallel  deformation  over  linear 
classes  is  its  ability  to  preserve  peculiarities  of  texture 
such  as  moles  or  birthmarks.  Parallel  deformation  will 
preserve  such  marks  since  it  samples  texture  from  the 
original  real  view  of  the  novel  person’s  face.  For  linear 
classes,  it  is  most  likely  that  a  random  mark  on  a  per¬ 
son’s  face  will  be  outside  the  linear  texture  space  of  the 
prototypes,  so  it  will  not  be  reconstructed  in  the  virtual 
view. 

5  Generating  virtual  views 

In  our  approach  to  recognizing  faces  using  just  one  ex¬ 
ample  view  per  person,  we  first  expand  the  example  set 
by  generating  virtual  views  of  each  person’s  face.  The 
full  set  of  views  that  we  would  ultimately  like  to  have  for 
our  view-based  face  recognizer  are  the  set  of  15  example 
views  shown  in  Fig.  3  and  originally  used  in  the  view- 
based  recognizer  of  Beymer  [10].  These  views  evenly 
sample  the  two  rotation  angles  out  of  the  image  plane. 

While  Fig.  3  shows  15  real  views,  in  virtual  views  we 
assume  that  only  view  m4  is  available  and  we  synthesize 
the  remaining  14  example  views.  For  the  single  real  view, 
an  off-center  view  was  favored  over,  say,  a  frontal  view 
because  of  the  recognition  results  for  bilaterally  symmet¬ 
ric  objects  of  Poggio  and  Vetter  [39].  When  the  single 
real  view  is  from  a  nondegenerate  pose  (i.e.  mirror  re¬ 
flection  is  not  equal  to  original  view),  then  the  mirror 
reflection  immediately  provides  a  second  view  that  can 
be  used  for  recognition.  The  choice  of  an  off-center  view 
is  also  supported  by  the  psychophysical  experiments  of 
Schyns  and  Biilthoff  [41].  They  found  that  when  humans 


Figure  3:  The  view-based  face  recognizer  uses  15  views  to  model  a  person’s  face.  For  virtual  views,  we  assume  that 
only  one  real  view,  view  m4,  is  available  and  we  synthesize  the  remaining  14. 


are  trained  on  just  one  pose  and  tested  on  many,  recogni¬ 
tion  performance  is  better  when  the  single  training  view 
is  an  off-center  one  as  opposed  to  a  frontal  pose. 

In  completing  the  set  of  15  example  views,  the  8  views 
neighboring  m4  will  be  generated  using  our  virtual  views 
techniques.  Using  the  terminology  of  the  theory  section, 
view  m4  is  the  standard  pose  and  each  of  the  neighbor¬ 
ing  views  are  virtual  poses.  The  remaining  6  views,  the 
right  two  columns  of  Fig.  3,  will  be  generated  by  assum¬ 
ing  bilateral  symmetry  of  the  face  and  taking  the  mirror 
reflection  of  the  left  two  columns. 

But  before  describing  our  implementation  of  parallel 
deformation  and  linear  classes,  we  need  to  define  some 
operators  on  shape. 

5.1  Shape  operators 

5.1.1  Vectorizing  face  images 

Computing  the  vectorized  representation  is  really  a 
feature  correspondence  problem.  The  difficulty  of  this 
correspondence  task  depends  on  the  difference  between 
the  two  image  arguments.  Finding  pixelwise  correspon¬ 
dence  between  the  images  of  two  dissimilar  people  is  in¬ 
herently  more  difficult  than  dealing  with  two  poses  of 
the  same  person,  both  situations  of  which  are  encoun¬ 
tered  in  virtual  views.  Thus,  we  have  looked  at  three 
ways  to  vectorize  faces,  a  manual  method,  optical  flow, 
and  a  new  automatic  technique  that  we  call  an  image 
vectorizer. 

The  pixelwise  correspondence  algorithms  discussed  in 
this  section  compute  a  relative  shape  ya_^,  i.e.  the  shape 
Ya  of  image  ia  with  respect  to  a  reference  image  This 
computation  will  be  denoted  using  the  vect  operator 

=  vect(i„,ij). 

Pictorially,  we  visualize  the  shape  in  Fig.  4  by  draw¬ 
ing  an  arrow  from  ii  to  ia-  Of  course,  given  this  relative 
shape  our  original  “absolute”  definition  of  shape 


Figure  4:  In  relative  shape,  denotes  feature  corre¬ 

spondence  between  ii  and  ia  using  ii  as  a  reference. 


Figure  5:  Manually  specified  line  segments  drive  Beier 
and  Neely’s  pixelwise  correspondence  technique. 

y\  can  be  computed  by  simply  adding  the  shape  y^, 
which  is  simply  the  x  and  y  coordinate  values  of  each 
pixel  in 

The  manual  technique  is  borrowed  from  Beier  and 
Neely’s  morphing  technique  in  computer  graphics  [5].  In 
their  technique,  a  sparse  set  of  corresponding  line  seg¬ 
ment  features  manually  placed  on  images  ia  and  ii  drive 
pixelwise  correspondence  between  the  two  images  (see 
Fig.  5).  Points  on  the  line  segments  are  mapped  exactly, 
and  points  in  between  are  mapped  using  a  weighted  com¬ 
bination  of  the  displacement  fields  generated  by  the  line 
segment  correspondences.  This  is  one  method  we  used 
for  computing  interperson  correspondence  -  correspon¬ 
dence  across  different  people.  While  this  technique  al- 


ways  works,  it  is  manual.  Ideally,  we  would  like  some¬ 
thing  automatic,  which  leads  us  to  the  next  two  tech¬ 
niques. 

The  optical  flow  technique  uses  the  gradient-based, 
hierarchical  method  of  Bergen  and  Hingorani  [7]  (also  see 
Lucas  and  Kanade  [29],  Bergen,  ei  aL  [6]).  Before  ap¬ 
plying  optical  flow,  face  images  are  brought  into  rough 
registration  using  the  eyes,  which  were  located  manu¬ 
ally.  Optical  flow  is  useful  for  computing  correspondence 
among  different  rotated  views  of  the  same  prototype.  It 
works  for  interperson  correspondence  when  the  two  peo¬ 
ple  are  similar  enough  in  grey-level  appearance,  but  this 
does  not  happen  frequently  enough  to  be  useful. 

Finally,  our  image  vectorizer  is  a  new  method  for 
computing  pixelwise  correspondence  between  an  input 
and  an  “average”  face  shape  jstd-  Beymer  [9]  provides 
the  details;  here  we  only  set  the  problem  up  and  sketch 
the  solution.  To  model  grey  level  face  texture,  the  vec¬ 
torizer  uses  a  set  of  N  shape  free  prototypes  ,  the 
same  set  as  described  before  for  texture  representation. 
“Vectorizing”  an  input  image  ia  means  simultaneously 
solving  for  (1)  an  optical  flow  that  converts  the 

input  to  the  shape  free  representation,  and  (2)  a  set  of 
linear  prototype  coefficients  pj  used  to  construct  a  model 
image  resembling  the  input.  This  is  done  by  iteratively 
solving 

C(x  +  yad.tdi^))  =  Ef=i 

by  alternating  between  the  operations  of  optical  flow  and 
projection  onto  the  prototypes  until  a  stable  solution  is 
found.  In  this  equation,  x  is  a  2D  point  (x,  y)  in  average 
shape  Ystd-  The  iterative  processing  of  shape  and  tex¬ 
ture  is  similar  to  the  active  shape  models  of  Cootes  and 
Taylor  [15],  Cootes,  ei  aL  [16],  and  Lanitis,  Taylor,  and 
Cootes  [27].  Jones  and  Poggio  [23]  also  describe  a  re¬ 
lated  system  that  uses  linear  combinations  of  prototype 
shapes  to  analyze  line  drawings. 

Correspondence  between  two  arbitrary  images  can 
thus  be  found  by  vectorizing  both,  as  now  both  images 
are  in  correspondence  with  the  average  shape.  After  vec¬ 
torizing  both  images,  one  flow  is  inverted  (which  can  be 
done  with  negation  followed  by  a  forward  warp),  and  the 
two  flows  are  then  concatenated.  This  is  an  automatic 
technique  for  finding  interperson  correspondence. 

5.1.2  Warping  and  shape  manipulation 
operators 

2D  warping  operations  move  pixel  values  back  and 
forth  between  the  reference  shape  and  the  destina¬ 
tion  shape  y^.  A  forward  warp,  fwarp,  pushes  pixels 
in  the  reference  frame  forward  along  the  flow  vectors 
to  the  destination  shape.  For  example,  back  in  Fig.  4, 
we  can  write  ia  =  fwarp(i^,,  y^_^).  In  general,  we  can 
push  pixels  along  any  arbitrary  flow  x,  yielding  the  more 
general  form  of  =  fwarp(i^,,  y^).  Note  that  the 

subscript  of  the  image  argument  must  match  the  super¬ 
script  of  the  shape  argument,  implying  that  the  image 
must  be  in  the  reference  frame  of  the  shape.  Inversely,  a 
backwards  warp,  bwarp,  uses  the  flow  as  an  index  into 
the  destination  shape,  bringing  pixels  in  the  destination 
shape  back  to  the  reference.  In  Fig.  4,  we  can  write 
ii  =  bwarp(i„,yyj). 


(a)  (b) 

Figure  6:  (a)  To  change  the  reference  frame  of  flow  y^ 
from  ia  to  the  x  and  y  components  are  forward  warped 
along  ya_i,  producing  the  dotted  flow  y^.  In  (b),  back¬ 
wards  warping  is  used  to  compute  the  inverse. 

Between  the  forward  and  backward  warping  opera¬ 
tions,  implementing  bwarp  is  the  easier  of  the  two.  Each 
pixel  in  the  reference  image  samples  the  destination  im¬ 
age  by  following  its  flow  vector.  Since  the  destination 
location  is  usually  between  pixels,  bilinear  interpolation 
is  used  to  produce  a  grey  level  value. 

Forward  warping  is  solved  using  the  idea  of  four  cor¬ 
ner  mapping  (see  Wolberg[54]).  Basically,  we  invert  the 
forward  warping  and  then  apply  the  backward  warping 
algorithm.  To  invert  the  forward  warping,  repeat  the 
following  for  every  square  source  patch  of  four  adjacent 
pixels  in  the  source.  Map  the  source  patch  to  a  quadrilat¬ 
eral  in  the  destination  image.  Then  for  each  destination 
pixel  inside  this  quadrilateral,  we  estimate  its  position 
inside  the  quadrilateral  treating  the  sides  of  the  quadri¬ 
lateral  as  a  warped  coordinate  system.  This  position  is 
used  to  map  to  a  location  in  the  original  source  patch. 

In  addition  to  warping  operations,  shapes  can  be  com¬ 
bined  using  binary  operations  such  as  addition  and  sub¬ 
traction.  In  adding  and  subtracting  shapes,  the  reference 
frames  of  both  shapes  must  be  the  same,  and  the  sub¬ 
scripts  of  the  shape  arguments  are  added/subtracted  to 
yield  the  subscripts  of  the  results:  y^^^  =  y^  ±  yj. 

The  reference  frame  of  a  shape  y^  can  be  changed 
from  if)  to  ia  by  applying  a  forward  warp  with  the  shape 
y^_^.  Shown  pictorially  in  Fig.  6(a),  the  operation  con¬ 
sists  of  separate  2D  forward  warps  on  the  x  and  y  com¬ 
ponents  of  y^  interpreted  for  the  moment  as  images  in¬ 
stead  of  vectors.  Instead  of  pushing  grey  level  pixels  in 
the  forward  warp,  we  push  the  x  and  y  components  of 
the  shape.  The  operation  in  Fig.  6(a)  is  denoted  y^  = 
fwarp- vect(y^,  y^_^).  The  inverse  operation,  shown  in 
Fig.  6(b),  is  computed  using  two  backwards  warps  in¬ 
stead  of  forward  ones:  y^  =  bwarp-vect(y2,  y^_^). 

Finally,  two  flows  fields  y^_^  and  y^_^  can  be  con¬ 
catenated  or  composed  to  produce  pixelwise  correspon¬ 
dences  between  ia  and  ic,  Yc-a-  Concatenation  is 
shown  pictorially  in  Fig.  7  and  is  denoted  y^-a  = 
concat(y^_^,  y^_^).  The  basic  idea  behind  implement¬ 
ing  this  operator  is  to  put  both  shapes  in  the  same 
reference  frame  and  then  add.  This  is  done  by  first 
computing  y^_^  =  bwarp-vect(y^_^,  y^_^)  followed  by 

Yc-a  =  y&-a  +  y?-&- 

Having  finished  our  primer  on  shape  operators,  we 
now  describe  how  parallel  deformation  and  linear  classes 
were  used  to  expand  the  example  set  with  virtual  views. 
Recognition  results  with  these  virtual  views  are  summa- 


y  c-a 


Figure  7:  In  flow  concatentation,  the  flows  and 

are  composed  to  produce  the  dotted  flow  y^-a* 


in,r—  in  +  (p,r-p) 


Figure  8:  In  parallel  deformation,  (A)  the  prototype  flow 
Yp  r-p  is  measured  between  ip^r  and  ip,  (B)  the  flow 
is  mapped  onto  the  novel  face  in,  and  (C)  the  novel  face 
is  2D  warped  to  the  virtual  view. 

rized  in  the  next  section. 

5.2  Parallel  deformation 

The  goal  of  parallel  deformation  is  to  map  a  facial  trans¬ 
formation  observed  on  a  prototype  face  onto  a  novel, 
non-prototype  face.  There  are  three  steps  in  implement¬ 
ing  parallel  deformation:  (a)  recording  the  deformation 
yp,r  ~  Yp  on  the  prototype  face,  (b)  mapping  this  de¬ 
formation  onto  the  novel  face,  and  (c)  2D  warping  the 
novel  face  using  the  deformation.  We  now  go  over  these 
steps  in  more  detail,  using  as  an  example  the  prototype 
views  and  single  novel  view  in  Fig.  8. 

First,  we  collect  prototype  views  ip  and  ip^r  and  com¬ 
pute  the  prototype  deformation 

Yp^r-p  =  vect(ip_r,ip) 

using  optical  flow.  Shown  overlayed  on  the  reference  im¬ 
age  on  the  left  of  Fig.  8,  this  2D  deformation  specifies 
how  to  forward  warp  ip  to  ip^r  and  represents  our  “prior 
knowledge”  of  face  rotation.  To  assist  the  correspon¬ 
dence  calculation,  a  sequence  of  four  frames  from  stan¬ 
dard  to  virtual  pose  is  used  instead  of  just  two  frames. 


proto  A  proto  B  proto  C 


Figure  9:  The  prototypes  used  for  parallel  deformation. 
Standard  poses  are  shown. 

Pairwise  optical  flows  are  computed  and  concatenated  to 
get  the  composite  flow  from  first  to  last  frame. 

Next,  the  2D  rotation  deformation  is  mapped  onto  the 
novel  person’s  face  by  changing  the  reference  frame  of 
Yp  r-p  fi*om  ip  to  in-  First,  interperson  correspondences 
between  ip  and  in  are  computed 

Yn-p  =  vect(i„,ip) 

and  used  to  change  the  reference  frame 

Yp,r-p  =  fwarp-vect(yy_p,yyp). 

The  flow  Yp  r-p  ^D  rotation  deformation  mapped 

onto  the  novel  person’s  standard  view.  As  the  interper¬ 
son  correspondences  are  difficult  to  compute,  we  evalu¬ 
ated  two  techniques  for  establishing  feature  correspon¬ 
dence:  labeling  features  manually  on  both  faces,  and  us¬ 
ing  our  face  vectorizer  (see  section  5.1.1  and  Beymer  [9]) 
to  automatically  locate  features.  More  will  be  said  about 
our  use  of  these  two  approaches  shortly. 

Finally,  the  texture  from  the  original  real  view  in  is 
2D  warped  onto  the  rotated  face  shape,  producing  the 
final  virtual  view 

^'n,r  —  ^'n  +  (p,r— p)  —  ^  Yp ^r—p}  ‘ 

Referring  to  our  running  example  in  Fig.  8,  the  final 
virtual  view  is  shown  in  the  lower  right. 

In  this  procedure  for  parallel  deformation,  there  are 
two  main  parameters  that  one  may  vary: 

1.  The  prototype.  As  mentioned  previously,  the  ac¬ 
curacy  of  virtual  views  generated  by  parallel  de¬ 
formation  depends  on  the  degree  to  which  the  3D 
shape  of  the  prototype  matches  the  3D  shape  of  the 
novel  face.  Thus,  one  would  expect  different  recog¬ 
nition  results  from  different  prototypes.  We  have 
experimented  with  virtual  views  generated  using 
the  three  different  prototypes  shown  in  Fig.  9.  In 
general,  given  a  particular  novel  person,  it  is  best 
to  have  a  variety  of  prototypes  to  choose  from  and 
to  try  to  select  the  one  that  is  closest  to  the  novel 
person  in  terms  of  shape. 

2.  Approach  for  interperson  correspondence.  In  both 
the  manual  and  automatic  approaches,  interperson 
correspondences  are  driven  by  the  line  segment  fea¬ 
tures  shown  in  Fig.  10.  The  automatic  segments 
shown  on  the  right  were  located  using  our  face  vec¬ 
torizer  from  Beymer  [9].  The  manual  segments 
on  the  left  include  some  additional  features  not 
returned  by  the  vectorizer,  especially  around  the 


example  example 

manual  segments  automatic  segments 


Figure  10:  Parallel  deformation  requires  correspon¬ 
dences  between  the  prototype  and  novel  person.  These 
correspondences  are  driven  by  the  segment  features 
shown  in  the  figure.  The  features  on  the  left  were  man¬ 
ually  located,  and  the  features  on  the  right  were  auto¬ 
matically  located  using  the  vectorizer. 


sides  of  the  face.  Given  these  sets  of  correspon¬ 
dences,  the  interpolation  method  from  Beier  and 
Neely  [5]  (see  section  5.1.1)  is  used  to  interpolate 
the  correspondences  to  define  a  dense,  pixelwise 
mapping  from  the  prototype  to  novel  face. 

Figures  11  and  12  show  example  virtual  views  gener¬ 
ated  using  prototype  A  with  the  real  view  in  the  center. 
Manual  interperson  correspondences  were  used  in  Fig.  11 
and  the  image  vectorizer  in  Fig.  12.  To  compare  views 
generated  from  the  different  prototypes.  Fig.  13  shows 
virtual  views  generated  from  all  three  prototypes.  For 
comparison  purposes,  the  real  view  of  each  novel  person 
is  shown  on  the  right. 

5.3  Linear  Classes 

We  use  the  linear  class  idea  to  analyze  the  novel  texture 
in  terms  of  the  prototypes  at  the  standard  view  and  re¬ 
construct  at  the  virtual  view.  In  the  analysis  step  at  the 
standard  view,  we  decompose  the  shape  free  texture  of 
the  novel  view  in  terms  of  the  N  shape  free  prototype 
views  tp^ 

tn  =  J2j=i  ^  (14) 

which  results  in  a  set  of  /Sj  prototype  coefficients.  But 
before  solving  this  equation  for  the  pj ,  the  novel  view  in 
and  prototype  views  ip^  must  be  vectorized  to  produce 
the  geometrically  normalized  textures  and  tp^ ,  1  < 
j  <  N.  Since  the  can  be  put  into  correspondence 

manually  in  an  off-line  step  (using  the  interpolation  tech¬ 
nique  of  Beier  and  Neely  [5]),  the  primary  difficulty  of 
this  step  is  in  converting  in  into  its  shape  free  represen¬ 
tation  Since  in  is  an  m4  view  of  the  face,  this  step 
means  finding  correspondence  between  in  and  view  m4^s 
standard  face  shape.  Let  this  standard  shape  be  denoted 
as  Ystd- 

Our  image  vectorizer  (Beymer  [9])  is  used  to  solve 
for  the  correspondences  between  in  and  standard 

shape  Ystd-  These  correspondences  can  then  be  used  to 
geometrically  standardize  in 

t„(x)  =  i„(x  +  y^!f,j^(x)), 


^  ^  ^ fee; 


Figure  11:  Example  virtual  views  using  parallel  defor¬ 
mation.  Prototype  A  was  used,  and  interperson  corre¬ 
spondence  y^_p  was  specified  manually. 


Figure  12:  Example  virtual  views  using  parallel  defor¬ 
mation.  Prototype  A  was  used,  and  interperson  corre¬ 
spondence  yn_p  was  computed  automatically  using  the 
image  vectorizer. 
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real  view 


Figure  13:  Example  virtual  views  as  the  prototype  per¬ 
son  is  varied.  The  corresponding  real  view  of  each  novel 
person  is  shown  on  the  right  for  comparison. 


where  x  is  an  arbitrary  2D  point  (x,  y)  in  standard  shape. 
Fig.  14  on  the  left  shows  an  example  view  in  with  some 
features  automatically  located  by  the  vectorizer.  The 
right  side  of  the  figure  shows  templates  of  the  eyes, 
nose,  and  mouth  that  have  been  geometrically  normal¬ 
ized  using  the  correspondences 

Next,  the  texture  is  decomposed  as  a  linear  com¬ 
bination  of  the  prototype  textures,  following  equation 
(14).  First,  combine  the  pj  terms  into  a  column  vector 
/?  and  define  a  matrix  T  of  the  prototype  textures,  where 
the  jth  column  of  T  is  tp^.  Then  equation  (14)  can  be 
rewritten  as 

tn  =  T[3. 

This  can  be  solved  using  linear  least  squares,  yielding 
P  =  Thn, 

where  is  the  pseudoinverse  {T^T) 

The  synthesis  step  assumes  that  the  textural  decom¬ 
position  at  the  virtual  view  is  the  same  as  that  at  the 
standard  view.  Thus,  we  can  synthesize  the  virtual  tex¬ 
ture 

N 

ln,r  —  ^  ^  (^j  tp j ; 

i=i 

where  tp^^r  are  the  shape  free  prototypes  that  have  been 
warped  to  the  standard  shape  of  the  virtual  view.  As 
with  the  tp^ds,  the  are  put  into  correspondence 

manually  in  an  off-line  step.  If  we  define  a  matrix 
such  that  column  j  is  tp^p^,  the  analysis  and  synthesis 
steps  can  be  written  as  a  linear  mapping  from  to 

t  —  T  T  ^  t 

^n,r  —  -L  r  • 

This  linear  mapping  was  previously  discussed  in  sec¬ 
tion  4.1  for  generating  virtual  shapes. 

Fig.  15  shows  a  set  of  virtual  views  generated  using 
the  analysis  of  Fig.  14.  Note  that  the  prototype  views 


Figure  14:  Using  correspondences  from  our  face  vector¬ 
izer,  we  can  geometrically  normalize  input  in,  producing 
the  “shape  free”  texture 

must  be  of  the  same  set  of  people  across  all  nine  views. 
We  used  a  prototype  set  of  55  people,  so  we  had  to  spec¬ 
ify  manual  correspondence  (see  Fig.  5)  for  9  views  of  each 
person  to  set  up  the  shape  free  views.  When  generating 
the  virtual  views  for  a  particular  person,  we  would,  of 
course,  remove  him  from  the  prototype  set  if  he  were  ini¬ 
tially  present,  following  a  cross  validation  methodology. 

Notice  from  Fig.  15  that  by  using  the  shape  free  tex¬ 
tural  representation,  the  virtual  views  in  this  experiment 
are  decoupled  from  shape  and  hence  all  views  are  in  the 
standard  shape  of  the  virtual  pose.  The  only  difference 
between  the  views  of  different  people  at  a  fixed  pose  will 
be  their  texture. 

6  Experimental  results 

In  this  section  we  report  the  recognition  rates  obtained 
when  virtual  views  were  used  in  our  view-based  recog¬ 
nizer  [10]. 

6.1  View-based  recognizer 

In  our  view-based  face  recognizer  [10],  the  15  example 
views  of  Fig.  3  are  stored  for  each  person  to  handle  pose 
invariance.  To  recognize  an  input  view,  our  recognizer 
uses  a  strategy  of  registering  the  input  with  the  exam¬ 
ple  views  followed  by  template  matching.  To  drive  the 
registration  step  in  the  recognizer,  a  person-  and  pose- 
invariant  feature  finder  first  locates  the  irises  and  a  nose 
lobe  feature.  Similar  in  flavor  to  the  recognizer,  the  fea¬ 
ture  finder  is  template-based,  using  a  large  set  of  eyes- 
nose  templates  from  a  variety  of  “exemplar”  people  and 
the  15  example  poses. 
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Figure  15:  Example  virtual  views  for  linear  classes. 


After  feature  detection,  the  input  is  repetitively 
matched  against  all  example  views  of  all  people.  Match¬ 
ing  the  input  against  a  particular  example  view  consists 
of  two  steps,  a  geometrical  registration  step  and  corre¬ 
lation.  In  the  registration  step,  first  an  affine  transform 
is  applied  to  the  input  to  bring  the  iris  and  nose  lobe 
features  into  correspondence  with  the  same  points  on 
the  example  view.  While  this  brings  the  two  views  into 
coarse  alignment,  small  pose  or  expressional  differences 
may  remain.  To  bring  the  input  and  example  into  closer 
correspondence,  optical  flow  is  computed  between  the 
two  and  a  2D  warp  driven  by  the  flow  brings  the  two 
into  pixelwise  correspondence.  Lastly,  normalized  cor¬ 
relation  with  example  templates  of  the  eyes,  nose,  and 
mouth  is  used  to  evaluate  the  match.  The  best  match 
from  the  data  base  is  reported  as  the  identified  person. 

6.2  Recognition  results 

To  test  the  recognizer,  a  set  of  10  testing  views  per  per¬ 
son  were  taken  to  randomly  sample  poses  within  the 
overall  range  of  poses  in  Fig.  3.  Roughly  half  of  the  test 
views  include  an  image-plane  rotation,  so  all  three  rota¬ 
tional  degrees  of  freedom  are  tested.  There  are  62  people 
in  the  database,  including  44  males  and  18  females,  peo¬ 
ple  from  different  races,  and  an  age  range  from  the  20s 
to  the  40s.  Lighting  for  all  views  is  frontal  and  facial 
expression  is  neutral. 

Table  1  shows  recognition  rates  for  parallel  deforma¬ 
tion  for  the  different  prototypes  and  for  manual  vs.  auto¬ 
matic  features.  As  with  the  experiments  with  real  views 
in  Beymer  [10],  the  recognition  rates  were  recorded  for 
a  forced  choice  scenario  -  the  recognizer  always  reports 
the  best  match.  In  the  template-based  recognizer,  tem¬ 
plate  scale  was  fixed  at  an  intermediate  scale  (interoc¬ 
ular  distance  =  30  pixels)  and  preprocessing  was  fixed 
at  dx-hdy  (the  sum  of  separate  correlations  on  the  x 
and  y  components  of  the  gradient).  These  parameters 
had  yielded  the  best  recognition  rates  for  real  views  in 


interperson 

correspondence 

prototype 

A 

B 

C 

manual 

84.5% 

83.9% 

83.9% 

auto 

85.2% 

84.0% 

83.4% 

Table  1:  Recognition  rates  for  parallel  deformation  for 
the  different  prototypes  and  for  manual  vs.  automatic 
features. 
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Beymer  [10].  The  results  were  fairly  consistent,  with  a 
mean  recognition  rate  of  84.1%  and  a  standard  deviation 
of  only  0.6%.  Automatic  feature  correspondence  on  av¬ 
erage  was  as  good  as  the  manual  correspondences,  which 
was  a  good  result  for  the  face  vectorizer.  In  the  manual 
case,  though,  it  is  important  to  note  that  the  manual 
step  is  at  “model-building”  time;  the  face  recognizer  at 
run  time  is  still  completely  automatic. 

Fig.  16  summarizes  our  experiments  with  using  real 
and  virtual  views  in  the  recognizer.  Starting  on  the 
right,  we  repeat  the  result  from  Beymer  [10]  where  we 
use  15  real  views  per  person.  This  recognition  rate  of 
98.7%  presents  a  “best  case”  scenario  for  virtual  views. 
The  real  views  case  is  followed  by  parallel  deformation, 
which  gives  a  recognition  rate  of  85.2%  for  prototype  A 
and  automatic  interperson  correspondences.  Next,  lin¬ 
ear  classes  on  texture  yields  a  recognition  rate  of  73.5%. 
To  put  these  two  recognition  numbers  in  context,  we 
compare  them  to  a  “base”  case  that  uses  only  two  ex¬ 
ample  views  per  person,  the  real  view  m4  plus  its  mirror 
reflection.  A  recognition  rate  of  70%  was  obtained  for 
this  two  view  case,  thus  establishing  a  lower  bound  for 
virtual  views.  Parallel  deformation  at  85%  falls  midway 
between  the  benchmark  cases  of  70%  (one  view  -h  mir¬ 
ror  reflect.)  and  98%,  (15  views)  so  it  shows  that  virtual 
views  do  benefit  pose-invariant  face  recognition. 

In  addition,  the  leftmost  bar  in  Fig.  16  (one  view) 
gives  the  recognition  rate  when  only  the  view  m4  is  used. 
This  shows  how  much  using  mirror  reflection  helps  in  the 
single  real  view  case:  without  the  view  generated  by  mir¬ 
ror  reflection,  the  recognition  rate  is  roughly  cut  in  half 
from  70%  to  32%.  This  low  recognition  rate  is  caused 
by  winnowing  of  example  views  based  on  the  coarse  pose 
estimate  (looking  left  vs.  looking  right)  of  the  input.  If 
the  input  view  is  “looking  right” ,  then  the  system  does 
not  even  try  to  match  against  the  m4  example  view, 
which  is  “looking  left”.  In  this  (one  view)  case,  62%  of 
the  inputs  are  rejected,  and  6%  of  the  inputs  give  rise  to 
substitution  errors. 

Linear  classes  for  virtual  texture  was  a  disappoint¬ 
ment,  however,  only  yielding  a  recognition  rate  a  few 
percentage  points  higher  than  the  base  case  of  70%.  This 
may  have  been  due  to  the  factoring  out  of  shape  informa¬ 
tion.  We  also  noticed  that  the  linear  reconstruction  has 
a  “smoothing”  effect,  reproducing  the  lower  frequency 
components  of  the  face  better  than  the  higher  frequency 
ones.  One  difference  in  the  experimental  test  conditions 
with  respect  to  parallel  deformation  was  that  correla¬ 
tion  was  performed  on  the  original  grey  levels  instead 
of  dx-hdy;  empirically  we  obtained  much  worse  perfor¬ 
mance  after  applying  a  differential  operator. 


+  classes  deform,  views 
mirror  (texture) 
relfect. 


Figure  16:  Face  recognition  performance  for  real  and 
virtual  views. 


7  Discussion 

7.1  Evaluation  of  recognition  rate 

While  the  recognition  rate  using  virtual  views,  rang¬ 
ing  from  85%  for  parallel  deformation  to  73%  for  linear 
classes,  is  much  lower  than  the  98%  rate  for  the  multi¬ 
ple  views  case,  this  was  expected  since  virtual  views  use 
much  less  information.  One  way  to  evaluate  these  rates 
is  to  use  human  performance  as  a  benchmark.  To  test 
human  performance,  one  would  provide  a  subject  with  a 
set  of  training  images  of  previously  unknown  people,  us¬ 
ing  only  one  image  per  person.  After  studying  the  train¬ 
ing  images,  the  subject  would  be  asked  to  identify  new 
images  of  the  people  under  a  variety  of  poses.  Moses,  Ull- 
man,  and  Edelman  [32]  have  performed  this  experiment 
using  testing  views  at  a  variety  of  poses  and  lighting 
conditions.  While  high  recognition  rates  were  observed 
in  the  subjects  (97%),  the  subjects  were  only  asked  to 
discriminate  between  three  different  people.  Bruce  [12] 
performs  a  similar  experiment  where  the  subject  is  asked 
whether  a  face  had  appeared  during  training,  and  detec¬ 
tion  rates  go  down  to  either  76%  or  60%,  depending  on 
the  amount  of  pose/expression  difference  between  the 
testing  and  training  views.  Schyns  and  Biilthoff  [41]  ob¬ 
tain  a  low  recognition  rate,  but  their  results  are  difficult 
to  compare  since  their  stimuli  are  Gouraud  shaded  3D 
faces  that  exclude  texture  information.  Lando  and  Edel¬ 
man  [26]  have  recently  performed  computational  exper¬ 
iments  to  replicate  earlier  psychophysical  results  in  [32]. 
A  recognition  rate  of  only  76%  was  reported,  but  the 
authors  suggest  that  this  may  be  improved  by  using  a 
two-stage  classifier  instead  of  a  single-stage  one. 

Direct  comparison  of  our  results  to  related  face  recog¬ 
nition  systems  is  difficult  because  of  differences  in  exam¬ 


ple  and  testing  views.  The  closest  systems  are  those  of 
Lando  and  Edelman  [26]  and  Maurer  and  von  der  Mals- 
burg  [31].  Both  systems  explore  a  view  transformation 
method  that  effectively  generates  new  views  from  a  sin¬ 
gle  view.  The  view  representation,  in  contrast  to  our 
template-based  approach,  is  feature-based:  Lando  and 
Edelman  use  difference  of  Gaussian  features,  and  Mau¬ 
rer  and  von  der  Malsburg  use  a  set  of  Gabor  filters  at  a 
variety  of  scales  and  rotations  (called  “jets”).  The  prior 
knowledge  Lando  and  Edelman  used  to  transform  faces 
is  similar  to  ours,  views  of  prototype  faces  at  standard 
and  virtual  views.  They  average  the  transformation  in 
feature  space  over  the  prototypes  and  apply  this  aver¬ 
age  transformation  to  a  novel  object  to  produce  a  “vir¬ 
tual”  set  of  features.  As  mentioned  above,  they  report  a 
recognition  rate  of  76%.  Maurer  and  von  der  Malsburg 
transform  their  Gabor  jet  features  by  approximating  the 
facial  surface  at  each  feature  point  as  a  plane  and  then 
estimating  how  the  Gabor  jet  changes  as  the  plane  ro¬ 
tates  in  3D.  They  apply  this  technique  to  rotating  faces 
about  45°  between  frontal  and  half-profile  views.  They 
report  a  recognition  rate  of  53%  on  a  subset  of  90  people 
from  the  FERET  database. 

Two  other  comparable  results  are  from  Manjunath,  ei 
aL  [30],  who  obtain  86%  on  a  database  of  86  people,  and 
Pentland,  ei  aL  [34],  whose  extrapolation  experiment 
with  view-based  eigenspaces  yields  83%  on  a  database  of 
21  people.  In  both  cases,  the  system  is  trained  on  a  set  of 
views  (vs.  just  one  for  ours)  and  recognition  performance 
is  tested  on  views  from  outside  the  pose-expression  space 
of  the  training  set.  One  difference  in  example  views  is 
that  they  include  hair  and  we  do  not.  In  the  future,  the 
new  Army  FERET  database  should  provide  a  common 
benchmark  for  comparing  recognition  algorithms. 

7.2  Difficulties  with  virtual  views  generation 

Since  we  know  that  the  view-based  approach  performs 
well  with  real  example  views,  making  the  virtual  views 
closer  in  appearance  to  the  “true”  rotated  views  would 
obviously  improve  recognition  performance.  What  dif¬ 
ficulties  do  we  encounter  in  generating  “true”  virtual 
views?  First,  the  parallel  deformation  approach  for 
shape  essentially  approximates  the  3D  shape  of  the  novel 
person  with  the  3D  shape  of  the  prototype.  If  the  two  3D 
shapes  are  different,  the  virtual  view  will  not  be  “true” 
even  though  it  may  still  appear  to  be  a  valid  face.  The 
resulting  shape  is  a  mixture  of  the  novel  and  prototype 
shapes.  Using  multiple  prototypes  and  the  linear  class 
approach  may  provide  a  better  shape  approximation. 

In  addition,  for  parallel  deformation  we  have  prob¬ 
lems  with  areas  that  are  visible  in  the  virtual  view  but 
not  in  the  standard  view.  For  example,  for  the  m4  pose, 
the  underside  of  the  nose  is  often  not  visible.  How  can 
one  predict  how  that  region  appears  for  upward  looking 
virtual  views?  Possible  ways  to  address  this  problem  in¬ 
clude  using  additional  real  views  or  having  the  recognizer 
exclude  those  regions  during  matching. 

7.3  Transformations  besides  rotation 

While  the  theory  and  recognition  experiments  in  this 
paper  revolve  around  generating  rotated  virtual  views. 
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one  may  also  wish  to  generate  virtnal  views  for  different 
lighting  conditions  or  expressions.  This  wonld  be  nsefnl 
for  bnilding  a  view-based  face  recognizer  that  handles 
those  kinds  of  variation  in  the  inpnt.  Here  we  snggest 
ways  to  generate  these  views. 

7.3.1  Lighting 

For  changes  in  lighting  conditions,  the  prototype  faces 
are  fixed  in  pose  bnt  the  position  of  the  light  sonrce  is 
changed  between  the  standard  and  virtnal  views.  Un- 
fortnnately,  changing  the  direction  of  the  light  sonrce 
violates  an  assnmption  made  for  linear  classes  that  the 
lighting  conditions  are  fixed.  That  assnmption  had  al¬ 
lowed  ns  to  ignore  the  fact  that  snrface  albedo  and  the 
local  snrface  normal  are  confonnded  in  the  Lambertian 
model  for  image  intensity. 

However,  the  idea  of  parallel  deformation  can  still  be 
applied.  Parallel  deformation  assnmes  that  the  3D  shape 
of  the  prototype  is  similar  to  the  3D  shape  of  the  novel 
person.  Thns,  corresponding  points  on  the  two  faces 
shonld  have  the  same  local  snrface  normal.  The  follow¬ 
ing  analysis  focnses  on  the  image  brightness  of  the  same 
featnre  point  on  both  the  prototype  and  novel  face.  The 
two  featnre  points  may  have  been  bronght  into  corre¬ 
spondence  throngh  a  vectorization  procednre.  Let 


r] 


^std 

^virtual 

Pproto 

Pnov 


snrface  normal  for  both  the  prototype 
and  novel  faces 

light  sonrce  direction  for  standard  lighting 
light  sonrce  direction  for  virtnal  lighting 
albedo  for  the  prototype  face 
albedo  for  the  novel  face 


The  prior  knowledge  of  the  lighting  transformation  can 
be  represented  by  the  ratio  of  the  prototype  image  in¬ 
tensities  nnder  the  two  lighting  directions 

Pproto {p  ^virtual) 

Pproto  '  ^std) 

Simply  by  mnltiplying  by  the  image  intensity  of  the  novel 
person  pnov{^  *  htd)  and  cancelling  terms,  one  can  get 


pnov 


{r]  ■  I 


virtua 


l)^ 


which  is  the  image  intensity  of  the  novel  featnre  point 
nnder  the  virtnal  lighting.  Overall,  the  novel  face  textnre 
is  modnlated  by  the  changes  in  the  prototype  lighting, 
an  approach  that  has  been  explored  by  Brnnelli  [13]. 


7.3.2  Expression 

In  this  case,  the  prototypes  are  fixed  in  pose  and  light¬ 
ing  bnt  differ  in  expression,  with  the  standard  view  be¬ 
ing,  say,  a  nentral  expression  and  the  virtnal  view  being 
a  smile,  frown,  etc.  When  generating  virtnal  views,  we 
need  to  captnre  both  nonrigid  shape  deformations  and 
the  snbtle  textnre  changes  snch  as  the  darkening  effect 
of  dimples  or  winkles.  Thns,  virtnal  views  generation 
techniqnes  for  both  shape  and  textnre  are  reqnired. 

Predicting  virtnal  expressions,  however,  seems  more 
difficnlt  than  the  rotation  or  lighting  case.  This  is  be- 
canse  the  way  a  person  smiles  or  frowns  is  probably  de- 
conpled  from  how  to  decompose  his  nentral  face  as  a 


linear  combination  of  the  prototypes.  To  the  extent  that 
they  are  deconpled,  the  approaches  we  have  snggested 
for  generating  virtnal  shapes  and  textnres  will  be  an  ap¬ 
proximation.  Onr  problems  show  np  mathematically  in 
the  nonrigidness  of  the  transformation;  the  linear  class 
idea  for  shape  assnmes  a  rigid  3D  transform.  The  im¬ 
plication  of  these  problems  is  that  the  expense  of  mnlti- 
ple  prototypes  is  probably  not  jnstified;  one  is  probably 
better  off  nsing  jnst  one  or  a  few  prototypes.  In  ear¬ 
lier  work  aimed  primarily  at  compnter  graphics  [8],  we 
demonstrated  parallel  deformation  for  transformations 
from  nentral  to  smiling  expressions. 

7.4  Future  work 

For  fntnre  work  on  onr  approach  to  virtnal  views,  we 
plan  to  nse  mnltiple  prototypes  for  generating  virtnal 
shape.  Vetter  and  Poggio  [51]  have  already  done  some 
work  in  applying  the  linear  class  idea  to  both  shape  and 
textnre.  It  wonld  be  interesting  to  test  some  of  their 
virtnal  views  in  a  view-based  recognizer.  In  the  longer 
term,  one  can  test  the  virtnal  views  techniqne  for  face 
recognition  nnder  different  lighting  conditions  or  expres¬ 
sions. 

For  the  problem  of  recognizing  faces  from  jnst  one  ex¬ 
ample  view,  it  shonld  be  possible  to  nse  the  idea  of  linear 
classes  withont  actnally  synthesizing  virtnal  views.  The 
basic  idea  is  to  compare  faces  based  on  sets  of 
coefficients  from  eqnations  (5)  and  (9)  rather  than  ns¬ 
ing  correlation  in  an  image  space.  According  to  linear 
classes,  the  decomposition  for  a  specific  individ- 

nal  shonld  be  invariant  to  pose.  As  explained  in  sec¬ 
tion  4.1,  linear  classes  is  based  on  the  assnmption  that 
the  3D  shape  vector  of  the  inpnt  Y  and  the  3D  textnre 
vector  T  are  linear  combinations  of  the  shapes  and  tex¬ 
tnres  of  prototype  faces.  Under  certain  conditions,  the 
linear  coefficients  of  the  3D  decomposition  are 

compntable  from  an  arbitrary  2D  view.  Thns,  the  coeffi¬ 
cients  shonld  be  invariant  to  pose  since  they  are  derived 
from  a  3D  representation.  It  follows  that  the 
coefficients  shonld  themselves  be  an  effective  representa¬ 
tion  for  faces.  The  coefficients  of  the  nnidentified  inpnt 
view  can  be  directly  matched  against  the 

data  base  coefficients  of  each  person  at  standard  pose 
{aj  ^  .  Note  that  the  linear  coefficients  are  not  a 

trne  invariant  becanse  the  recognizer  at  rnn-time  needs 
to  have  an  estimate  of  the  ont-of-plane  image  rotation 
of  the  inpnt. 


8  Conclusion 
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In  this  paper  we  have  addressed  the  problem  of  recogniz¬ 
ing  faces  nnder  different  poses  when  only  one  example 
view  of  each  person  is  available.  Given  one  real  view  at 
a  known  pose,  we  nse  prior  knowledge  of  faces  to  gener¬ 
ate  virtual  views,  views  of  the  face  as  seen  from  different 
poses.  Rather  than  nsing  a  more  traditional  3D  mod¬ 
eling  approach,  prior  knowledge  of  faces  is  expressed  in 
the  form  of  2D  views  of  rotating  prototype  faces.  Given 
the  2D  prototype  views  and  a  single  real  view  of  a  novel 
person,  we  demonstrated  two  techniqnes  for  effectively 
rotating  the  novel  face  in  depth.  First,  in  parallel  defor¬ 
mation,  a  facial  transformation  observed  on  a  prototype 


face  in  mapped  onto  a  novel  face  and  nsed  to  warp  the 
novel  view.  Second,  in  linear  classes,  the  single  novel 
view  is  decomposed  as  a  linear  combination  of  prototype 
views  at  the  same  pose.  Then  these  same  linear  coeffi¬ 
cients  are  nsed  to  synthesize  a  virtnal  view  of  the  novel 
person  by  taking  a  linear  combination  of  the  prototype 
views  at  virtnal  pose.  We  demonstrated  this  for  the  grey 
level,  or  textnral,  component  of  the  face. 

To  evalnate  virtnal  views,  they  were  then  nsed  as  ex¬ 
ample  views  in  a  view-based,  pose-invariant  face  recog¬ 
nizer.  On  a  database  of  62  people  with  10  test  views  per 
person,  a  recognition  rate  of  85%  was  achieved  in  experi¬ 
ments  with  parallel  deformation,  which  is  well  above  the 
base  recognition  rate  of  70%  when  only  one  real  view 
(pins  its  mirror  reflection)  is  nsed.  Also,  onr  recogni¬ 
tion  rate  is  similar  to  other  face  recognition  experiments 
where  extrapolation  from  the  pose-expression  range  of 
the  example  views  is  tested.  Overall,  for  the  problem  of 
generating  new  views  of  an  object  from  jnst  one  view, 
these  resnlts  demonstrate  that  the  2D  example-based 
techniqne,  similarly  to  3D  object  models,  may  be  a  viable 
method  for  representing  knowledge  of  object  classes. 
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A  Appendix:  how  general  is  the  linear 
class  assumption? 

Eollowing  the  basic  snggestion  in  Poggio  (1991)  and  Pog- 
gio  and  Vetter  (1992)  we  consider  the  problem  of  creat¬ 
ing  virtnal  views  within  the  framework  of  learning-from- 
examples  techniqnes.  In  this  metaphor,  a  learning  mod- 
nle  snch  as  a  Regnlarization  Network  is  trained  with  a  set 
of  inpnt-ontpnt  examples  of  objects  of  the  same  “nice” 
class  (see  Vetter  et  ah,  1995)  to  learn  a  certain  class- 
specific  transformation.  Eor  a  rotation  transformation 
“frontal”  views  are  associated  with  the  “rotated”  views 
of  the  same  faces.  Eor  the  “smile”  transformation  ”seri- 
ons”  views  are  associated  with  “smiling”  views.  One  can 
think  of  the  frontal  view  as  the  inpnt  to  the  network  and 
the  rotated  view  as  the  corresponding  ontpnt  dnring  the 
training  phase. 

One  may  be  inclined  to  believe  that  this  approach 
may  be  more  powerfnl  than  the  simple  linear  techniqne 
described  in  the  the  text  and  jnstified  nnder  the  linear 
class  assnmption  of  Poggio  and  Vetter.  Thongh  this  is 
trne,  the  behavior  of  a  generic  network  trained  as  de¬ 
scribed  above  is  still  severely  restricted.  It  tnrns  ont  that 
nnder  rather  general  conditions  the  ontpnt  of  a  very  large 
class  of  learning-from-examples  techniqnes  prodnces  vir¬ 
tnal  views  that  are  contained  in  the  linear  space  spanned 
by  the  (ontpnt)  examples. 

One  way  to  express  the  resnlt  is  the  following.  If  the 
transformed  view  of  an  object  cannot  be  represented  as 
a  linear  combination  of  transformed  views  of  prototypes 
(in  smaller  number  than  view  dimensionality)  then  no 
learning  module  (under  quite  weak  conditions)  can  learn 
that  transformations  from  example  pairs. 

The  basic  resnlt  is  implied  by  Girosi,  Jones  and  Poggio 
(1995)  and  by  Beymer,  Shashna  and  Poggio  (1993).  We 
reprodnce  it  here  for  completeness. 

The  simplest  version  of  a  regnlarization  network  ap¬ 
proximates  a  vector  field  y(x)  as 

N 

i  =  l 

which  we  rewrite  in  matrix  terms  as 


y(x)  =  Cg(x),  (16) 

where  g  is  the  vector  with  elements  gi  —  G(x  —  x*). 
Defining  as  G  is  the  matrix  of  the  chosen  basis  fnnction 
(Gij  =  G(:Ki  —  Xj))  evalnated  at  the  examples  we  obtain 

(G)cm  =  y™  (17) 

and  also 

C=YG1.  (18) 

It  follows  that  the  vector  field  is  approximated  as  the 
linear  combination  of  example  fields,  that  is 

y(x)  =  YGlg(x)  (19) 

that  is 

N 

=  (20) 

1=1 

where  the  bj  depend  on  the  chosen  G,  according  to 

b(x)  =  YGlg(x).  (21) 

Thns  for  any  choice  of  the  regularization  network  the 
ontpnt  (vector)  image  is  always  a  linear  combination  of 
example  (vector)  images  with  coefficients  b  that  depend 
(nonlinearly)  on  the  desired  inpnt  valne.  This  is  trne 
for  a  large  class  of  networks  trained  with  the  error 
criterion,  inclnding  many  types  of  Nenral  Networks  (the 
observation  is  by  E.  Girosi). 

Thns,  the  virtnal  views  that  can  be  generated  by  a 
large  class  of  learning  techniqnes  are  always  contained 
in  the  linear  snbspace  of  the  examples.  This  non  trivial 
observation  means  that  the  latter  property  is  rather  gen¬ 
eral  and  does  not  depend  on  the  linear  class  assumption 
of  Poggio  and  Vetter. 

Notice  that  for  these  resnlts  to  hold  we  assnme  cor¬ 
respondence  between  all  inpnt  vectors  and  all  ontpnt 
vectors  separately.  Strictly  speaking  correspondence  is 
not  needed  between  inpnt  and  ontpnt  vectors  (this  ob¬ 
servation  is  dne  to  T.  Vetter). 

Let  ns  define  separately  the  shape  component  of  onr 
images  and  the  textnre  components.  The  shape  is  a  vec¬ 
tor  y  of  all  xi,  yi,  •  •  • ,  Xn,  Vn  describing  the  image  po¬ 
sition  of  each  of  n  pixels.  We  consider  separately  the 
vector  of  corresponding  textnres  for  the  same  n  pixels 
as  t  consisting  of  the  grey  valnes  7i ,  •  •  • ,  7^^ .  We  conld 
also  consider  the  extended  vector  E  obtained  by  concate¬ 
nating  y  and  t.  The  previons  resnlt  show  that  that  the 
ontpnt  rotated  shape  image  obtained  for  a  new  frontal 
image  y^^  is  in  the  linear  space  spanned  by  the  examples. 
Thns 

y^='^Ciyl.  (22) 

Of  conrse  in  general  the  c  are  not  identical  to  the 
coefficients  of  the  inpnt  representation  (as  in  the  linear 
class  case  described  in  the  text)  and  will  be  a  nonlinear 
fnnction  of  them.  In  the  linear  class  case  the  c  can  be 
learned  by  a  simple  linear  network  withont  hidden  layers 
(see  main  text). 


Since  the  linear  class  argnment  can  also  be  extended 
to  the  textnre  component  t  (see  main  text  and  Vetter 
and  Poggio,  1995),  the  same  observations  stated  here  for 
shape  can  also  be  applied  to  textnre. 

One  conld  impose  linear  class  conditions  on  E:  the 
dimensionality  of  the  space  will  however  be  qnite  larger. 
In  general,  one  shonld  keep  separate  the  shape  and  the 
textnre  components  and  to  span  them  with  independent 
basis. 

Notice  (see  also  Beymer,  Poggio  and  Shashna,  1993) 
that  in  many  sitnations  it  may  be  advantageons  to  trans¬ 
form  the  ontpnt  representation  from 

Vr  =  '^Ciyl  (23) 

n 

to 

yr=Y.c*y*\,  (24) 

q 

where  the  y*  are  the  basis  of  a  KL  decomposition  and 
q  «  n. 

B  Appendix:  linear  classes 

As  explained  in  section  4.1,  linear  classes  is  a  techniqne 
for  synthesizing  new  views  of  an  object  nsing  views  of 
prototypical  objects  belonging  to  the  same  object  class. 
The  basic  idea  is  to  decompose  the  novel  object  as  a 
linear  combination  of  the  prototype  objects.  This  de¬ 
composition  is  performed  separately  for  the  shape  and 
textnre  of  the  novel  object.  In  this  appendix,  we  explain 
the  mathematical  detail  behind  the  linear  class  approach 
for  shape  and  textnre.  Please  refer  to  sections  3  and  4.1 
for  definitions  of  the  example  prototype  images,  mathe¬ 
matical  operators,  etc. 

B.l  Shape  (Poggio  and  Vetter,  1992) 

In  this  section,  we  reformnlate  the  description  of  linear 
classes  for  shape  that  originally  appeared  in  Poggio  and 
Vetter  [39].  The  development  here  makes  explicit  the 
fact  that  the  vectorized  y  vectors  need  not  be  in  corre¬ 
spondence  between  the  standard  and  virtnal  poses. 

Linear  classes  begins  with  the  assnmption  that  a  novel 
object  is  a  linear  combination  of  a  set  of  prototype  ob¬ 
jects  in  3D 

Yn=Ef=l«Ww  (25) 

From  this  assnmption,  it  is  easy  to  see  that  any  2D  view 
of  the  novel  object  will  be  the  same  linear  combination 
of  the  corresponding  2D  views  of  the  prototypes.  That 
is,  the  3D  linear  decomposition  is  the  same  as  the  2D 
linear  decomposition.  Using  eqnation  (2)  which  relates 
3D  and  2D  shape  vectors,  let  be  a  2D  view  of  a 
novel  object 

yn,r  =  LYn  (26) 

and  let  be  2D  views  of  the  prototypes 

=  LYp^  l<j<N.  (27) 

Apply  the  operator  L  to  both  sides  of  eqnation  (25) 

i.Y„  =  nEf=iaWw)- 


We  can  bring  L  inside  the  snm  since  L  is  linear 

=Ef=iaEYp^..  (29) 

Snbstitnting  eqnations  (26)  and  (27)  yields 

yn,r  —  /  vj  — 1  ^jypj,r- 

Thns,  the  2D  linear  decomposition  nses  the  same  set  of 
linear  coefficients  as  with  the  3D  vectorization. 

Next,  we  show  that  nnder  certain  assnmptions,  the 
novel  object  can  be  analyzed  at  standard  pose  and  the 
virtnal  view  synthesized  at  virtnal  pose  nsing  a  single  set 
of  linear  coefficients.  Again,  assnme  that  a  novel  object 
is  a  linear  combination  of  a  set  of  prototype  objects  in 
3D 

Yn=Ef=l«Ww  (30) 

Say  that  we  have  2D  views  of  the  prototypes  at  standard 
pose  Yp^ ,  2D  views  of  the  prototypes  at  virtnal  pose 
Ypj,r,  and  a  2D  view  of  the  novel  object  y^^  at  standard 
pose.  Additionally,  assnme  that  the  2D  views  jp^  are 
linearly  independent.  Project  both  sides  of  eqnation  (30) 
nsing  the  rotation  for  standard  pose,  yielding 

V — \N 

Yn  —  Z^j  =  l  ^jYpj  • 

A  nniqne  solntion  for  the  aj  exist  since  the  jp^  are  lin¬ 
early  independent.  Now,  since  we  have  solved  for  the 
same  set  of  coefficients  in  the  3D  linear  class  assnmption, 
the  decomposition  at  virtnal  pose  mnst  nse  the  same  co¬ 
efficients 

_ 

yn,r  —  /  vj  — 1  ^jYpj.r- 

That  is,  we  can  recover  the  a^-’s  from  the  view  at  stan¬ 
dard  pose  and  nse  the  aj ’s  to  generate  the  virtnal  view 
of  the  novel  object. 

B.2  Texture 

Virtnally  the  same  argnment  can  be  applied  to  the  ge¬ 
ometrically  normalized  textnre  vectors  t.  The  idea  of 
applying  linear  classes  to  textnre  was  thonght  of  by  the 
anthors  and  independently  by  Vetter  and  Poggio  [51]. 

With  the  textnre  case,  assnme  that  a  novel  object 
textnre  is  a  linear  combination  of  a  set  of  prototype 
textnres 

Tn  =  Ef=iA'Tp,-  (31) 

As  with  shape,  we  show  that  the  3D  linear  decomposi¬ 

tion  is  the  same  as  the  2D  linear  decomposition.  Using 
eqnation  (3)  which  relates  3D  and  2D  textnre  vectors, 
let  tn^r  be  a  2D  textnre  of  a  novel  object 

tn,r  =  DTn  (32) 

and  let  tp^^r  be  2D  textnres  of  the  prototypes 

=  DTp^  l<j<N.  (33) 

Apply  the  operator  D  to  both  sides  of  eqnation  (31) 

=i4(EUATp,).  (34) 

We  can  bring  D  inside  the  snm  since  D  is  linear 

=  Ef=iA^Tp^- 


(28) 
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(35) 


Substituting  equations  (32)  and  (33)  yields 

tn,r  =  — 1  Pj'^pj.r- 


Thus,  as  with  shape,  the  2D  linear  decomposition  for 
texture  uses  the  same  set  of  linear  coefficients  as  with 
the  3D  vectorization. 

Next,  we  show  that  under  certain  linear  independence 
assumptions,  the  novel  object  texture  can  be  analyzed  at 
standard  pose  and  the  virtual  view  synthesized  at  vir¬ 
tual  pose  using  a  single  set  of  linear  coefficients.  Again, 
assume  that  a  novel  object  texture  T  is  a  linear  combi¬ 
nation  of  a  set  of  prototype  objects 


Tn  =Ef=i/3jTp,-  (36) 


Say  that  we  have  2D  textures  of  the  prototypes  at  stan¬ 
dard  pose  ,  the  2D  prototype  textures  at  virtual  pose 
and  a  2D  texture  of  the  novel  object  at  standard 
pose  Additionally,  assume  that  the  2D  textures  tp^. 
are  linearly  independent.  Project  both  sides  of  equation 
(36)  using  the  rotation  for  standard  pose,  yielding 


A  unique  solution  for  the  pj  exist  since  the  tp^.  are  lin¬ 
early  independent.  Now,  since  we  have  solved  for  the 
same  set  of  coefficients  in  the  3D  linear  class  assumption, 
the  decomposition  at  virtual  pose  must  use  the  same  co¬ 
efficients 


-  2^i=i 


R  ■  -f 

A  ^Pj.r- 


That  is,  we  can  recover  the  ’s  from  the  view  at  stan¬ 
dard  pose  and  use  the  /?j’s  to  generate  the  virtual  view 
of  the  novel  object. 
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